Intro to R

Larisa M. Soto

2022-09-13

logo

This workshop is beginner-level introduction to programming in R. The course is designed to be taught in two sessions of 3 hours each and is focused on the application of R to the analysis of tabular data from clinical trials.

1 The basics

Learning objectives

  • Become familiar with the language and the logic behind it
  • Create a project in R studio
  • Configure the working directory
  • Create your first R script
  • Get fluent in R using the console
  • Compute arithmetic operations
  • Use logical operators on variables
  • Learn how to ask for help
  • Get comfortable installing packages

1.1 Installing packages

There are multiple sources and ways to do this.

CRAN

install.packages(c("dplyr","ggplot2","gapminder","medicaldata"))

BioConductor

For more details about the project you can visit https://www.bioconductor.org

To install packages from BioConductor you first need to install BioConductor itself.

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager","https://stat.ethz.ch/CRAN/")
BiocManager::install(version = "3.15")

Then you can install any package you want by using the install

BiocManager::install("DESeq2")

GitHub

If you want to install the development version of a package, or you are installing something that is only available on GitHub you can use devtools

install_github('andreacirilloac/updateR')

1.2 Syntax

Comments

# This is a comment line 

Accessing content

letters[1]
## [1] "a"
letters[2]
## [1] "b"
head(iris$Sepal.Length)
## [1] 5.1 4.9 4.7 4.6 5.0 5.4

1.3 Aritmetic Operations

# Additon
2+2
## [1] 4
# Subtraction
3-5
## [1] -2
# Multiplication
71*9
## [1] 639
# Division
90/3
## [1] 30
# Power
2^3
## [1] 8

1.4 Creating variables

# The convention is to use left hand assignation 
var1 <- 12
var2 <- "hello world"
var1
## [1] 12
var2
## [1] "hello world"
# It is also possible to use the '=' sign, but is not a good practice
var1 = 12
var2 = "hello world"
var1
## [1] 12
var2
## [1] "hello world"

1.5 Logical operators

# First create two numeric variables
var1 <- 35
var2 <- 27
# Equal to
var1 == var2
## [1] FALSE
# Less than or equal to
var1 <= var2
## [1] FALSE
# They also work with other classes
var1 <- "mango"
var2 <- "mangos"
var1 == var2
## [1] FALSE

Strings are compared character by character until they are not equal or there are no more characters left to compare.

var1 < var2
## [1] TRUE

We can test if a variable is contained in another object

"c" %in% letters
## [1] TRUE
"c" %in% LETTERS
## [1] FALSE

1.6 Seeking help

Concatenate function

?c()

Print the description of an object

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

2 Data types and data structures

Learning objectives

  • Understand the differences between classes, objects and data types in R
  • Create objects of different types
  • Subset and index objects
  • Learn and use vectorized operations

2.1 Vectors

Key points:
- Can only contain objects of the same class
- Most basic type of R object
- Variables are vectors

2.1.1 Numeric

They store numbers as double, and it is stored with decimals. The term double refers to the number of bytes required to store it. Each double is accurate up to 16 significant digits.

Creating a numeric vector using c()

x <- c(0.3, 0.1)
x
## [1] 0.3 0.1

Using the vector() function

x <- vector(mode = "numeric",length = 10)
x
##  [1] 0 0 0 0 0 0 0 0 0 0

Using the numeric() function

x <- numeric(length = 10)
x
##  [1] 0 0 0 0 0 0 0 0 0 0

Creating a numeric vector with a sequence of numbers

x <- seq(1,10,1)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- seq(1,10,2)
x
## [1] 1 3 5 7 9
x <- rep(2,10)
x
##  [1] 2 2 2 2 2 2 2 2 2 2

2.1.2 Integer

They store numbbers that can be written without a decimal component.

Creating an integer vector using c()

x <- c(1L,2L,3L,4L,5L)  
x
## [1] 1 2 3 4 5

Creating an integer vector of a sequences of numbers

x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10

2.1.3 Logical

Creating a logical vector with c()

x <- c(TRUE,FALSE,T,F)
x
## [1]  TRUE FALSE  TRUE FALSE

Creating a logical vector with vector()

x <- vector(mode = "logical",length = 5)
x
## [1] FALSE FALSE FALSE FALSE FALSE

Creating a logical vector using logical()

x <- logical(length = 10)
x
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

2.1.4 Character

x<-c("a","b","c")
x
## [1] "a" "b" "c"
x<-vector(mode = "character",length=10)
x
##  [1] "" "" "" "" "" "" "" "" "" ""
x<-character(length = 3)
x
## [1] "" "" ""

Some useful functions to modify strings

tolower(LETTERS)
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
toupper(letters)
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
paste(letters,1:length(letters),sep="_") # Note the implicit coercion
##  [1] "a_1"  "b_2"  "c_3"  "d_4"  "e_5"  "f_6"  "g_7"  "h_8"  "i_9"  "j_10"
## [11] "k_11" "l_12" "m_13" "n_14" "o_15" "p_16" "q_17" "r_18" "s_19" "t_20"
## [21] "u_21" "v_22" "w_23" "x_24" "y_25" "z_26"

2.1.5 Vector attributes

The elements of a vector can have names

x<-1:5
names(x)<-c("one","two","three","four","five")
x
##   one   two three  four  five 
##     1     2     3     4     5
x<-logical(length = 4)
names(x)<-c("F1","F2","F3","F4")
x
##    F1    F2    F3    F4 
## FALSE FALSE FALSE FALSE

2.1.6 Built-in functions

To inspect the contents of a vector

is.vector(x) # Check if it is a vector
## [1] TRUE
is.na(x) # Check if it is empty
##    F1    F2    F3    F4 
## FALSE FALSE FALSE FALSE
is.null(x) # Check if it is NULL
## [1] FALSE
is.numeric(x) # Check if it is numeric
## [1] FALSE
is.logical(x) # Check if it is logical
## [1] TRUE
is.character(x) # Check if it is character
## [1] FALSE

To know what kind of vector you are working with

class(x) # Atomic class type
## [1] "logical"
typeof(x) # Object type or data structure (matrix, list, array...)
## [1] "logical"
str(x)
##  Named logi [1:4] FALSE FALSE FALSE FALSE
##  - attr(*, "names")= chr [1:4] "F1" "F2" "F3" "F4"

To know more about the data contained in the vector

length(x)
## [1] 4
table(x)
## x
## FALSE 
##     4
summary(x)
##    Mode   FALSE 
## logical       4

Mathematical operations

sum(x)
## [1] 0
min(x)
## [1] 0
max(x)
## [1] 0
mean(x)
## [1] 0
median(x)
## [1] 0
sd(x)
## [1] 0
log(x)
##   F1   F2   F3   F4 
## -Inf -Inf -Inf -Inf
exp(x)
## F1 F2 F3 F4 
##  1  1  1  1

2.1.7 Vector arithmetics

x<-1:10
y<-11:20
x*2
##  [1]  2  4  6  8 10 12 14 16 18 20
x+y
##  [1] 12 14 16 18 20 22 24 26 28 30
x*y
##  [1]  11  24  39  56  75  96 119 144 171 200
x^y
##  [1] 1.000000e+00 4.096000e+03 1.594323e+06 2.684355e+08 3.051758e+10
##  [6] 2.821110e+12 2.326305e+14 1.801440e+16 1.350852e+18 1.000000e+20

2.1.8 Recycling

x<-1:10
y<-c(1,2)
x+y
##  [1]  2  4  4  6  6  8  8 10 10 12

2.1.8.1 Exercise

Calculate the sum of the following sequence of fractions:

x = 1/(1^2) + 1/(2^2) + 1/(3^2) + ... + 1/(n^2)

# n=100
sum(1/(1:100)^2)
## [1] 1.634984
# n=10000
sum(1/(1:10000)^2)
## [1] 1.644834

2.1.9 Indexing and subsetting

For this example, lets create a vector of random numbers from 1 to 100 of size 15.

x<-sample(x = 1:100,size = 15,replace = F) 
x
##  [1] 79 12 28  6 90 23 96 63 67 89 64 51 76 84 46

Using the index/position

x[1] # Get the first element
## [1] 79
x[13] # Get the thirteenth element
## [1] 76

Using a vector of indices

x[1:12] # The first 12 numbers
##  [1] 79 12 28  6 90 23 96 63 67 89 64 51
x[c(1,5,6,8,9,13)] # Specific positions only
## [1] 79 90 23 63 67 76

Using a logical vector

# Only numbers that are less than or equal to 10
x<10
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE
x[x<=10] 
## [1] 6
# Only even numbers 
x%%2 == 0
##  [1] FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE
## [13]  TRUE  TRUE  TRUE
x[x%%2 == 0]
## [1] 12 28  6 90 96 64 76 84 46
x<10
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE
x[x<=10] # Only numbers that are less than or equal to 10
## [1] 6

Skipping elements using indices

x[c(-1, -5)]
##  [1] 12 28  6 23 96 63 67 89 64 51 76 84 46

Skipping elements using names

x<-1:10
names(x)<-letters[1:10]
x[names(x) != "a"]
##  b  c  d  e  f  g  h  i  j 
##  2  3  4  5  6  7  8  9 10

2.1.9.1 Exercise

Find all the odd numbers in x

2.2 Lists

Key points:
- Can contain objects of multiple classes
- Extremely powerful when combined with some R built-in functions

Creating lists with different data types

l <- list(10, "hello", TRUE)
l
## [[1]]
## [1] 10
## 
## [[2]]
## [1] "hello"
## 
## [[3]]
## [1] TRUE

Assigning names as we create the list

l<-list(title = "Numbers", 
        numbers = 1:10, 
        logic = TRUE )
l
## $title
## [1] "Numbers"
## 
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $logic
## [1] TRUE
names(l)
## [1] "title"   "numbers" "logic"

2.2.1 Indexing and subsetting

Using [[]] instead of []

l[[1]]
## [1] "Numbers"

Using $ for named lists

l$logic
## [1] TRUE

2.2.2 Built-in functions

l<-list(sample(1:100,10),
        sample(1:100,10),
        sample(1:100,10))
names(l)<-c("r1","r2","r3")

Performing operations on all elements of the list using lapply

lsums<-lapply(l,sum)
lsums
## $r1
## [1] 513
## 
## $r2
## [1] 415
## 
## $r3
## [1] 603

2.3 Factors

Key points:

  • Useful when for categorical data
  • Can have implicit order, if needed
  • Each element has a label or level
  • They are important in statistical modelling and plotting with ggplot
  • Some operations behave differently on factors

Creating factors with factor

cols<-factor(x = c(rep("red",4),rep("blue",5),rep("green",2)),
             levels = c("red","blue","green"))
cols
##  [1] red   red   red   red   blue  blue  blue  blue  blue  green green
## Levels: red blue green
samples <- c("case", "control", "control", "case")
samples
## [1] "case"    "control" "control" "case"
samples_factor <- factor(samples, levels = c("control", "case"))
samples_factor
## [1] case    control control case   
## Levels: control case
str(samples_factor)
##  Factor w/ 2 levels "control","case": 2 1 1 2

2.3.1 Built-in functions

Grouping elements in a vector using tapply

measurements<-sample(1:1000,6)
samples<-factor(c(rep("case",3),rep("control",3)), levels = c("control", "case"))
tapply(measurements, samples, mean)
##  control     case 
## 577.3333 913.6667

2.4 Matrices

Creating a matrix full of zeros with matrix()

m<-matrix(0, ncol=6, nrow=3)
m
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0
class(m)
## [1] "matrix" "array"
typeof(m)
## [1] "double"

Creating a matrix from a vector of numbers

m<-matrix(1:10, ncol=2, nrow=5)
m
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10

2.4.1 Attributes

Names of each dimension

colnames(m)<-letters[1:2]
rownames(m)<-LETTERS[1:5]
m
##   a  b
## A 1  6
## B 2  7
## C 3  8
## D 4  9
## E 5 10
str(m)
##  int [1:5, 1:2] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:5] "A" "B" "C" "D" ...
##   ..$ : chr [1:2] "a" "b"

2.4.2 Built-in functions

To know the size of the matrix

dim(m)
## [1] 5 2
ncol(m)
## [1] 2
nrow(m)
## [1] 5

2.4.2.1 Exercise

What do you think that length(m) will return?

2.5 Data frames

Key points:

  • Columns in data frames are vectors
  • Each column can be of a different data type
  • A data frame is essentially a list of vectors

Creating a data frame using data.frame()

df<-data.frame(numbers=1:10,
               low_letters=letters[1:10],
               logical_values=rep(c(T,F),each=5))
df
##    numbers low_letters logical_values
## 1        1           a           TRUE
## 2        2           b           TRUE
## 3        3           c           TRUE
## 4        4           d           TRUE
## 5        5           e           TRUE
## 6        6           f          FALSE
## 7        7           g          FALSE
## 8        8           h          FALSE
## 9        9           i          FALSE
## 10      10           j          FALSE
class(df)
## [1] "data.frame"
typeof(df)
## [1] "list"
str(df)
## 'data.frame':    10 obs. of  3 variables:
##  $ numbers       : int  1 2 3 4 5 6 7 8 9 10
##  $ low_letters   : chr  "a" "b" "c" "d" ...
##  $ logical_values: logi  TRUE TRUE TRUE TRUE TRUE FALSE ...

Re-naming columns

colnames(df)[2]<-"lowercase"
head(df)
##   numbers lowercase logical_values
## 1       1         a           TRUE
## 2       2         b           TRUE
## 3       3         c           TRUE
## 4       4         d           TRUE
## 5       5         e           TRUE
## 6       6         f          FALSE

2.5.1 Indexing and subsetting

df$numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
df["numbers"]
##    numbers
## 1        1
## 2        2
## 3        3
## 4        4
## 5        5
## 6        6
## 7        7
## 8        8
## 9        9
## 10      10
df[1,]
##   numbers lowercase logical_values
## 1       1         a           TRUE
df[,1]
##  [1]  1  2  3  4  5  6  7  8  9 10

2.6 Coercion

Converting between data types with as. functions

x<-1:10
as.list(x)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 2
## 
## [[3]]
## [1] 3
## 
## [[4]]
## [1] 4
## 
## [[5]]
## [1] 5
## 
## [[6]]
## [1] 6
## 
## [[7]]
## [1] 7
## 
## [[8]]
## [1] 8
## 
## [[9]]
## [1] 9
## 
## [[10]]
## [1] 10
l<-list(numbers=1:10,
        lowercase=letters[1:10])
l
## $numbers
##  [1]  1  2  3  4  5  6  7  8  9 10
## 
## $lowercase
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
typeof(l)
## [1] "list"
df<-as.data.frame(l)
df
##    numbers lowercase
## 1        1         a
## 2        2         b
## 3        3         c
## 4        4         d
## 5        5         e
## 6        6         f
## 7        7         g
## 8        8         h
## 9        9         i
## 10      10         j
typeof(df)
## [1] "list"

2.7 Hands on: Data types

  • Make a matrix with the numbers 1:50, with 5 columns and 10 rows. Did the matrix function fill your matrix by column, or by row, as its default behavior? Once you have figured it out, try to change the default. (hint: read the documentation for matrix).
  • Create a list of length two containing a character vector for each of the data sections: (1) Data types and (2) Data structures. Populate each character vector with the names of the data types and data structures, respectively.
  • There are several subtly different ways to call variables, observations and elements from data frames. Try them all and discuss with your team what they return. (Hint, use the function typeof())
  • Take the list you created in 3 and coerce it into a data frame. Then change the names of the columns to “dataTypes” and “dataStructures”.

3 Basic data manipulation

Learning objectives

  • Learn how to read/write data to/from files with different formats (.tsv, .csv)
  • Familiarize with basic operations of data frames
  • Index and subset data frames using base R functions
  • Manipulate specific data frame columns
  • Joining by columns and rows

For this section we will use the package gapminder that we installed earlier.

library(gapminder)
dim(gapminder)
## [1] 1704    6
#View(gapminder)
summary(gapminder$country)
##              Afghanistan                  Albania                  Algeria 
##                       12                       12                       12 
##                   Angola                Argentina                Australia 
##                       12                       12                       12 
##                  Austria                  Bahrain               Bangladesh 
##                       12                       12                       12 
##                  Belgium                    Benin                  Bolivia 
##                       12                       12                       12 
##   Bosnia and Herzegovina                 Botswana                   Brazil 
##                       12                       12                       12 
##                 Bulgaria             Burkina Faso                  Burundi 
##                       12                       12                       12 
##                 Cambodia                 Cameroon                   Canada 
##                       12                       12                       12 
## Central African Republic                     Chad                    Chile 
##                       12                       12                       12 
##                    China                 Colombia                  Comoros 
##                       12                       12                       12 
##         Congo, Dem. Rep.              Congo, Rep.               Costa Rica 
##                       12                       12                       12 
##            Cote d'Ivoire                  Croatia                     Cuba 
##                       12                       12                       12 
##           Czech Republic                  Denmark                 Djibouti 
##                       12                       12                       12 
##       Dominican Republic                  Ecuador                    Egypt 
##                       12                       12                       12 
##              El Salvador        Equatorial Guinea                  Eritrea 
##                       12                       12                       12 
##                 Ethiopia                  Finland                   France 
##                       12                       12                       12 
##                    Gabon                   Gambia                  Germany 
##                       12                       12                       12 
##                    Ghana                   Greece                Guatemala 
##                       12                       12                       12 
##                   Guinea            Guinea-Bissau                    Haiti 
##                       12                       12                       12 
##                 Honduras         Hong Kong, China                  Hungary 
##                       12                       12                       12 
##                  Iceland                    India                Indonesia 
##                       12                       12                       12 
##                     Iran                     Iraq                  Ireland 
##                       12                       12                       12 
##                   Israel                    Italy                  Jamaica 
##                       12                       12                       12 
##                    Japan                   Jordan                    Kenya 
##                       12                       12                       12 
##         Korea, Dem. Rep.              Korea, Rep.                   Kuwait 
##                       12                       12                       12 
##                  Lebanon                  Lesotho                  Liberia 
##                       12                       12                       12 
##                    Libya               Madagascar                   Malawi 
##                       12                       12                       12 
##                 Malaysia                     Mali               Mauritania 
##                       12                       12                       12 
##                Mauritius                   Mexico                 Mongolia 
##                       12                       12                       12 
##               Montenegro                  Morocco               Mozambique 
##                       12                       12                       12 
##                  Myanmar                  Namibia                    Nepal 
##                       12                       12                       12 
##              Netherlands              New Zealand                Nicaragua 
##                       12                       12                       12 
##                    Niger                  Nigeria                   Norway 
##                       12                       12                       12 
##                     Oman                 Pakistan                   Panama 
##                       12                       12                       12 
##                  (Other) 
##                      516

3.1 Reading/writing data

3.1.1 Text files

Writing tables to a file using write.table()

aust <- gapminder[gapminder$country == "Australia",]
write.table(aust,
            file="data/gapminder_australia.csv",
            sep=",")
write.table(aust,
            file="data/gapminder_australia.csv",
            sep=",",
            quote=FALSE, 
            row.names=FALSE)
write.table(aust,
            file="data/gapminder_australia.tsv",
            sep="\t",
            quote=FALSE, 
            row.names=FALSE)

Other functions to write to a file

africa<-gapminder[gapminder$continent=="Africa",]
write.csv(gapminder[gapminder$continent=="Africa",],
          file = "data/gapminder_africa.csv",
          row.names = FALSE)
class(africa$continent)
## [1] "factor"

Reading data from a file

africa<-read.csv("data/gapminder_africa.csv",sep = ",",header = T)
class(africa$continent)
## [1] "character"
africa<-read.table("data/gapminder_africa.csv",sep = ",",header = T,stringsAsFactors = T)
class(africa$continent)
## [1] "factor"

3.1.2 R objects

Using .RDS files

saveRDS(africa,file = "objects/africa.RDS")
africa<-readRDS(file = "objects/africa.RDS")

Using .RData files

americas<-gapminder[gapminder$continent=="Americas",]
save(africa,americas,file = "objects/continents.RData")
load(file = "objects/continents.RData",verbose = T)
## Loading objects:
##   africa
##   americas

3.2 Exploring data frames

3.2.1 Adding columns and rows

Individually adding columns

mean_children <- sample(1:10,nrow(aust),replace = TRUE)
aust$mean_children <- mean_children
head(aust)
## # A tibble: 6 × 7
##   country   continent  year lifeExp      pop gdpPercap mean_children
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>         <int>
## 1 Australia Oceania    1952    69.1  8691212    10040.             3
## 2 Australia Oceania    1957    70.3  9712569    10950.             5
## 3 Australia Oceania    1962    70.9 10794968    12217.             4
## 4 Australia Oceania    1967    71.1 11872264    14526.             5
## 5 Australia Oceania    1972    71.9 13177000    16789.             3
## 6 Australia Oceania    1977    73.5 14074100    18334.             3
mean_bikes <- sample(1:4,nrow(aust),replace = TRUE) # Check what happens if they don't have the same number of rows
aust[,"mean_bikes"]<-mean_bikes
head(aust)
## # A tibble: 6 × 8
##   country   continent  year lifeExp      pop gdpPercap mean_children mean_bikes
##   <fct>     <fct>     <int>   <dbl>    <int>     <dbl>         <int>      <int>
## 1 Australia Oceania    1952    69.1  8691212    10040.             3          4
## 2 Australia Oceania    1957    70.3  9712569    10950.             5          3
## 3 Australia Oceania    1962    70.9 10794968    12217.             4          2
## 4 Australia Oceania    1967    71.1 11872264    14526.             5          1
## 5 Australia Oceania    1972    71.9 13177000    16789.             3          2
## 6 Australia Oceania    1977    73.5 14074100    18334.             3          3

Combining data frames

aust <- gapminder[gapminder$country=="Australia",]
df <- data.frame(mean_children=sample(1:10,nrow(aust),replace = TRUE),
               mean_bikes=sample(1:4,nrow(aust),replace = TRUE))
head(df)
##   mean_children mean_bikes
## 1             8          3
## 2             7          4
## 3             1          2
## 4             2          1
## 5             4          3
## 6             7          3
aust <- cbind(aust,df)
head(aust)
##     country continent year lifeExp      pop gdpPercap mean_children mean_bikes
## 1 Australia   Oceania 1952   69.12  8691212  10039.60             8          3
## 2 Australia   Oceania 1957   70.33  9712569  10949.65             7          4
## 3 Australia   Oceania 1962   70.93 10794968  12217.23             1          2
## 4 Australia   Oceania 1967   71.10 11872264  14526.12             2          1
## 5 Australia   Oceania 1972   71.93 13177000  16788.63             4          3
## 6 Australia   Oceania 1977   73.49 14074100  18334.20             7          3

Individually adding rows

new_row<-list("country" = "Australia",
               "continent" = "Oceania",
               "year" = 2022,
               "lifeExp" = mean(aust$lifeExp),
               "pop" = mean(aust$pop),
               "gdpPercap" = mean(aust$gdpPercap),
               "mean_children" = floor(mean(aust$mean_children)),
               "mean_bikes" = floor(mean(aust$mean_children))) # Why did I create it as list? 
new_row
## $country
## [1] "Australia"
## 
## $continent
## [1] "Oceania"
## 
## $year
## [1] 2022
## 
## $lifeExp
## [1] 74.66292
## 
## $pop
## [1] 14649312
## 
## $gdpPercap
## [1] 19980.6
## 
## $mean_children
## [1] 4
## 
## $mean_bikes
## [1] 4
aust<-rbind(aust,new_row)
tail(aust)
##      country continent year  lifeExp      pop gdpPercap mean_children
## 8  Australia   Oceania 1987 76.32000 16257249  21888.89             5
## 9  Australia   Oceania 1992 77.56000 17481977  23424.77             3
## 10 Australia   Oceania 1997 78.83000 18565243  26997.94             2
## 11 Australia   Oceania 2002 80.37000 19546792  30687.75             8
## 12 Australia   Oceania 2007 81.23500 20434176  34435.37             7
## 13 Australia   Oceania 2022 74.66292 14649312  19980.60             4
##    mean_bikes
## 8           2
## 9           2
## 10          2
## 11          3
## 12          2
## 13          4

Combining data frames by rows

dim(aust)
## [1] 13  8
aust_double<-rbind(aust,aust)
dim(aust_double)
## [1] 26  8

3.2.2 Removing columns and rows

aust<-aust[,-ncol(aust)]# remove the last column
head(aust)
##     country continent year lifeExp      pop gdpPercap mean_children
## 1 Australia   Oceania 1952   69.12  8691212  10039.60             8
## 2 Australia   Oceania 1957   70.33  9712569  10949.65             7
## 3 Australia   Oceania 1962   70.93 10794968  12217.23             1
## 4 Australia   Oceania 1967   71.10 11872264  14526.12             2
## 5 Australia   Oceania 1972   71.93 13177000  16788.63             4
## 6 Australia   Oceania 1977   73.49 14074100  18334.20             7
aust<-aust[,colnames(aust)!="mean_children"]# remove column by name
head(aust)
##     country continent year lifeExp      pop gdpPercap
## 1 Australia   Oceania 1952   69.12  8691212  10039.60
## 2 Australia   Oceania 1957   70.33  9712569  10949.65
## 3 Australia   Oceania 1962   70.93 10794968  12217.23
## 4 Australia   Oceania 1967   71.10 11872264  14526.12
## 5 Australia   Oceania 1972   71.93 13177000  16788.63
## 6 Australia   Oceania 1977   73.49 14074100  18334.20
dim(aust[-1,]) # Remove the first row
## [1] 12  6
dim(aust[-1*1:10,]) # Remove the first 10 rows
## [1] 3 6

3.2.3 Applying filters

aust[aust$lifeExp>=70,] 
##      country continent year  lifeExp      pop gdpPercap
## 2  Australia   Oceania 1957 70.33000  9712569  10949.65
## 3  Australia   Oceania 1962 70.93000 10794968  12217.23
## 4  Australia   Oceania 1967 71.10000 11872264  14526.12
## 5  Australia   Oceania 1972 71.93000 13177000  16788.63
## 6  Australia   Oceania 1977 73.49000 14074100  18334.20
## 7  Australia   Oceania 1982 74.74000 15184200  19477.01
## 8  Australia   Oceania 1987 76.32000 16257249  21888.89
## 9  Australia   Oceania 1992 77.56000 17481977  23424.77
## 10 Australia   Oceania 1997 78.83000 18565243  26997.94
## 11 Australia   Oceania 2002 80.37000 19546792  30687.75
## 12 Australia   Oceania 2007 81.23500 20434176  34435.37
## 13 Australia   Oceania 2022 74.66292 14649312  19980.60
aust[aust$gdpPercap>=mean(aust$gdpPercap),] 
##      country continent year  lifeExp      pop gdpPercap
## 8  Australia   Oceania 1987 76.32000 16257249  21888.89
## 9  Australia   Oceania 1992 77.56000 17481977  23424.77
## 10 Australia   Oceania 1997 78.83000 18565243  26997.94
## 11 Australia   Oceania 2002 80.37000 19546792  30687.75
## 12 Australia   Oceania 2007 81.23500 20434176  34435.37
## 13 Australia   Oceania 2022 74.66292 14649312  19980.60

How to get unique entries/remove duplicates

unique(aust_double)
##      country continent year  lifeExp      pop gdpPercap mean_children
## 1  Australia   Oceania 1952 69.12000  8691212  10039.60             8
## 2  Australia   Oceania 1957 70.33000  9712569  10949.65             7
## 3  Australia   Oceania 1962 70.93000 10794968  12217.23             1
## 4  Australia   Oceania 1967 71.10000 11872264  14526.12             2
## 5  Australia   Oceania 1972 71.93000 13177000  16788.63             4
## 6  Australia   Oceania 1977 73.49000 14074100  18334.20             7
## 7  Australia   Oceania 1982 74.74000 15184200  19477.01             4
## 8  Australia   Oceania 1987 76.32000 16257249  21888.89             5
## 9  Australia   Oceania 1992 77.56000 17481977  23424.77             3
## 10 Australia   Oceania 1997 78.83000 18565243  26997.94             2
## 11 Australia   Oceania 2002 80.37000 19546792  30687.75             8
## 12 Australia   Oceania 2007 81.23500 20434176  34435.37             7
## 13 Australia   Oceania 2022 74.66292 14649312  19980.60             4
##    mean_bikes
## 1           3
## 2           4
## 3           2
## 4           1
## 5           3
## 6           3
## 7           3
## 8           2
## 9           2
## 10          2
## 11          3
## 12          2
## 13          4

To remove empty rows

# First lets add an empty row
na.list<-rep(NA,ncol(aust))
aust<-rbind(aust,na.list)
tail(aust)
##      country continent year  lifeExp      pop gdpPercap
## 9  Australia   Oceania 1992 77.56000 17481977  23424.77
## 10 Australia   Oceania 1997 78.83000 18565243  26997.94
## 11 Australia   Oceania 2002 80.37000 19546792  30687.75
## 12 Australia   Oceania 2007 81.23500 20434176  34435.37
## 13 Australia   Oceania 2022 74.66292 14649312  19980.60
## 14      <NA>      <NA>   NA       NA       NA        NA
aust<-aust[!is.na(aust$country),]
tail(aust)
##      country continent year  lifeExp      pop gdpPercap
## 8  Australia   Oceania 1987 76.32000 16257249  21888.89
## 9  Australia   Oceania 1992 77.56000 17481977  23424.77
## 10 Australia   Oceania 1997 78.83000 18565243  26997.94
## 11 Australia   Oceania 2002 80.37000 19546792  30687.75
## 12 Australia   Oceania 2007 81.23500 20434176  34435.37
## 13 Australia   Oceania 2022 74.66292 14649312  19980.60

3.2.4 Editing specific elements

aust[1,"lifeExp"]<-aust[1,"lifeExp"]+1 

3.3 Hands-on: basic data manipulation

  1. Write a data processing snippet to include only the data points collected after 1995 in Asian countries as a CSV file.
  2. Separate the gapminder data frame into 5 individual data frames, one for each continent. Store those 5 data frames as an RData file called continents.RData in the objects folder.
  3. Finish exploring the gapminder data frame and:
  • Find the number of rows and the number of columns
  • Print the data type of each column
  • Explain the meaning of everything that str(gapminder) prints
  1. In which years has the GDP of Canada been larger than the average of all data points recorded for Canada?
  2. Find the mean life expectancy of Switzerland before and after 2000
  3. You discovered that all the entries from 2007 are actually from 2008. Create a copy of the full gapminder data frame in an object called gp. Then change the year column to correct the entries from 2007.
  4. Bonus - Find the mean life expectancy and mean gdp per continent using the function tapply

4 Advanced data manipulation

Learning objectives

  • Become familiar with the dplyr syntax
  • Create pipes with the operator %>%
  • Perform operations on data frames using dplyr and tidyr functions
  • Implement functions from other external packages

There are several packages that allow for more sophisticated processing operations to be done faster. We will take a look at some functions from one of them. I encourage you to look into plyr and tidyr after this workshop.

4.1 Manipulation with dplyr

We often need to select certain observations (rows) or variables (columns), or group the data by certain variable(s) to calculate some summary statistics. Although these operations can be done using base R functions, they require the creation of multiple intermediate objects and a lot of code repetition. There are two packages that provide functions to streamline common operations on tabular data and make the code look nicer and cleaner.

These packages are part of a broader family called tidyverse, for more information you can visit https://www.tidyverse.org/.

We will cover 5 of the most commonly used functions and combine them using pipes (%>%): 1. select() - used to extract data 2. filter() - to filter entries using logical vectors 3. group_by() - to solve the split-apply-combine problem 4. summarize() - to obtain summary statistics 5. mutate() - to create new columns

library(tidyr)

4.1.1 Introducing pipes

gapminder %>%
  head()
## # A tibble: 6 × 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
gapminder %>%
  tail()
## # A tibble: 6 × 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1982    60.4  7636524      789.
## 2 Zimbabwe Africa     1987    62.4  9216418      706.
## 3 Zimbabwe Africa     1992    60.4 10704340      693.
## 4 Zimbabwe Africa     1997    46.8 11404948      792.
## 5 Zimbabwe Africa     2002    40.0 11926563      672.
## 6 Zimbabwe Africa     2007    43.5 12311143      470.

4.1.2 Using select()

To subset a data frame

dplyr::select(.data = gapminder, 
       year, country, gdpPercap) %>%
  head()
## # A tibble: 6 × 3
##    year country     gdpPercap
##   <int> <fct>           <dbl>
## 1  1952 Afghanistan      779.
## 2  1957 Afghanistan      821.
## 3  1962 Afghanistan      853.
## 4  1967 Afghanistan      836.
## 5  1972 Afghanistan      740.
## 6  1977 Afghanistan      786.

To remove columns

dplyr::select(.data = gapminder, 
       -continent) %>%
      head()
## # A tibble: 6 × 5
##   country      year lifeExp      pop gdpPercap
##   <fct>       <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan  1952    28.8  8425333      779.
## 2 Afghanistan  1957    30.3  9240934      821.
## 3 Afghanistan  1962    32.0 10267083      853.
## 4 Afghanistan  1967    34.0 11537966      836.
## 5 Afghanistan  1972    36.1 13079460      740.
## 6 Afghanistan  1977    38.4 14880372      786.
gapminder %>% 
  dplyr::select(year, country, gdpPercap) %>%
  head()
## # A tibble: 6 × 3
##    year country     gdpPercap
##   <int> <fct>           <dbl>
## 1  1952 Afghanistan      779.
## 2  1957 Afghanistan      821.
## 3  1962 Afghanistan      853.
## 4  1967 Afghanistan      836.
## 5  1972 Afghanistan      740.
## 6  1977 Afghanistan      786.

4.1.3 Using filter()

Include only European countries and select the columns year, country and gdpPercap

gapminder %>%
    dplyr::filter(continent == "Europe") %>%
    dplyr::select(year, country, gdpPercap) %>%
    head()
## # A tibble: 6 × 3
##    year country gdpPercap
##   <int> <fct>       <dbl>
## 1  1952 Albania     1601.
## 2  1957 Albania     1942.
## 3  1962 Albania     2313.
## 4  1967 Albania     2760.
## 5  1972 Albania     3313.
## 6  1977 Albania     3533.

Using multiple filters at once

gapminder %>%
  dplyr::filter(continent == "Europe", year == 2007) %>%
  dplyr::select(country, lifeExp)
## # A tibble: 30 × 2
##    country                lifeExp
##    <fct>                    <dbl>
##  1 Albania                   76.4
##  2 Austria                   79.8
##  3 Belgium                   79.4
##  4 Bosnia and Herzegovina    74.9
##  5 Bulgaria                  73.0
##  6 Croatia                   75.7
##  7 Czech Republic            76.5
##  8 Denmark                   78.3
##  9 Finland                   79.3
## 10 France                    80.7
## # … with 20 more rows

Extract unique entries

gapminder %>%
  dplyr::select(country, continent) %>%
  dplyr::distinct()
## # A tibble: 142 × 2
##    country     continent
##    <fct>       <fct>    
##  1 Afghanistan Asia     
##  2 Albania     Europe   
##  3 Algeria     Africa   
##  4 Angola      Africa   
##  5 Argentina   Americas 
##  6 Australia   Oceania  
##  7 Austria     Europe   
##  8 Bahrain     Asia     
##  9 Bangladesh  Asia     
## 10 Belgium     Europe   
## # … with 132 more rows

Order according to a column

gapminder %>%
  dplyr::select(country, continent,year,pop) %>%
  dplyr::arrange(desc(pop)) %>%
  head()
## # A tibble: 6 × 4
##   country continent  year        pop
##   <fct>   <fct>     <int>      <int>
## 1 China   Asia       2007 1318683096
## 2 China   Asia       2002 1280400000
## 3 China   Asia       1997 1230075000
## 4 China   Asia       1992 1164970000
## 5 India   Asia       2007 1110396331
## 6 China   Asia       1987 1084035000

4.1.4 Using group_by()

It internally groups observations based on the specified variable(s)

str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
str(gapminder %>% dplyr::group_by(continent))
## grouped_df [1,704 × 6] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
##  - attr(*, "groups")= tibble [5 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##   ..$ .rows    : list<int> [1:5] 
##   .. ..$ : int [1:624] 25 26 27 28 29 30 31 32 33 34 ...
##   .. ..$ : int [1:300] 49 50 51 52 53 54 55 56 57 58 ...
##   .. ..$ : int [1:396] 1 2 3 4 5 6 7 8 9 10 ...
##   .. ..$ : int [1:360] 13 14 15 16 17 18 19 20 21 22 ...
##   .. ..$ : int [1:24] 61 62 63 64 65 66 67 68 69 70 ...
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

4.1.5 Using summarize()

gdp_c <- gapminder %>%
          dplyr::group_by(continent) %>%
          dplyr::summarize(mean_gdpPercap = mean(gdpPercap))
gdp_c
## # A tibble: 5 × 2
##   continent mean_gdpPercap
##   <fct>              <dbl>
## 1 Africa             2194.
## 2 Americas           7136.
## 3 Asia               7902.
## 4 Europe            14469.
## 5 Oceania           18622.

Combine multiple summary statistics

gapminder %>%
    dplyr::group_by(continent) %>%
    dplyr::summarize(mean_le = mean(lifeExp),
                      min_le = min(lifeExp),
                      max_le = max(lifeExp),
                      se_le = sd(lifeExp)/sqrt(dplyr::n()))
## # A tibble: 5 × 5
##   continent mean_le min_le max_le se_le
##   <fct>       <dbl>  <dbl>  <dbl> <dbl>
## 1 Africa       48.9   23.6   76.4 0.366
## 2 Americas     64.7   37.6   80.7 0.540
## 3 Asia         60.1   28.8   82.6 0.596
## 4 Europe       71.9   43.6   81.8 0.286
## 5 Oceania      74.3   69.1   81.2 0.775

4.1.6 Using mutate()

gapminder %>%
  dplyr::mutate(gdp_billion = gdpPercap*pop/10^9)
## # A tibble: 1,704 × 7
##    country     continent  year lifeExp      pop gdpPercap gdp_billion
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>       <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.        6.57
##  2 Afghanistan Asia       1957    30.3  9240934      821.        7.59
##  3 Afghanistan Asia       1962    32.0 10267083      853.        8.76
##  4 Afghanistan Asia       1967    34.0 11537966      836.        9.65
##  5 Afghanistan Asia       1972    36.1 13079460      740.        9.68
##  6 Afghanistan Asia       1977    38.4 14880372      786.       11.7 
##  7 Afghanistan Asia       1982    39.9 12881816      978.       12.6 
##  8 Afghanistan Asia       1987    40.8 13867957      852.       11.8 
##  9 Afghanistan Asia       1992    41.7 16317921      649.       10.6 
## 10 Afghanistan Asia       1997    41.8 22227415      635.       14.1 
## # … with 1,694 more rows

4.1.7 Putting them all together

gdp_pop_ext <-gapminder %>%
                dplyr::mutate(gdp_billion = gdpPercap*pop/10^9) %>%
                dplyr::group_by(continent,year) %>%
                dplyr::summarize(mean_gdpPercap = mean(gdpPercap),
                                 sd_gdpPercap = sd(gdpPercap),
                                 mean_pop = mean(pop),
                                 sd_pop = sd(pop),
                                 mean_gdp_billion = mean(gdp_billion),
                                 sd_gdp_billion = sd(gdp_billion)) 
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.

4.2 Hands-on advanced data manipulation

  1. Write one command (it can span multiple lines) using pipes that will output a data frame that has only the columns lifeExp, country and year for the records before the year 2000 from African countries, but not for other Continents.
  2. Calculate the average life expectancy per country. Which country has the longest average life expectancy and which one the shortest average life expectancy?
  3. In the previous hands-on you discovered that all the entries from 2007 are actually from 2008. Write a command to edit the data accordingly using pipes. In the same command filter only the entries from 2008 to verify the change.

5 Generating visual outputs

5.1 Graphics with base R

hist(gapminder$lifeExp,xlab="Life expectancy")

Arrange figures into multiple panels with par

df<-gapminder[gapminder$country=="Switzerland",]
par(mfrow=c(1,3))
plot(y = df$lifeExp,x=df$year,xlab="Years",ylab="Life expectancy")
plot(y = df$pop,x=df$year,xlab="Years",ylab="Population size")
plot(y = df$gdpPercap,x=df$year,xlab="Years",ylab="GDP per capita")

df<-gapminder[gapminder$country=="Zimbabwe",]
par(mfrow=c(1,3))
plot(y = df$lifeExp,x=df$year,xlab="Years",ylab="Life expectancy")
plot(y = df$pop,x=df$year,xlab="Years",ylab="Population size")
plot(y = df$gdpPercap,x=df$year,xlab="Years",ylab="GDP per capita")

5.2 Graphics with ggplot2

library(ggplot2)

We can look at multiple countries at the same time in a prettier way

df<-gapminder %>%
      dplyr::mutate(country = as.character(country)) %>%
      dplyr::filter(country %in% c("Switzerland","Australia","Zimbabwe","India"))
      
ggplot(df,aes(x=year,y=lifeExp,color=country))+
  geom_point()+
  geom_line()

ggplot(df,aes(x=year,y=gdpPercap,color=country))+
  geom_point()+
  geom_line()

Now, let’s plot the mean GDP per-capita over time for each continent

gdp_c <- gapminder %>%
          dplyr::group_by(continent,year) %>%
          dplyr::summarize(mean_gdpPercap = mean(gdpPercap),
                           mean_le = mean(lifeExp),
                           min_le = min(lifeExp),
                           max_le = max(lifeExp),
                           se_le = sd(lifeExp)/sqrt(dplyr::n()))
## `summarise()` has grouped output by 'continent'. You can override using the
## `.groups` argument.
head(gdp_c)
## # A tibble: 6 × 7
## # Groups:   continent [1]
##   continent  year mean_gdpPercap mean_le min_le max_le se_le
##   <fct>     <int>          <dbl>   <dbl>  <dbl>  <dbl> <dbl>
## 1 Africa     1952          1253.    39.1   30     52.7 0.714
## 2 Africa     1957          1385.    41.3   31.6   58.1 0.779
## 3 Africa     1962          1598.    43.3   32.8   60.2 0.815
## 4 Africa     1967          2050.    45.3   34.1   61.6 0.844
## 5 Africa     1972          2340.    47.5   35.4   64.3 0.890
## 6 Africa     1977          2586.    49.6   36.8   67.1 0.944
ggplot(gdp_c,aes(x=year,y=mean_gdpPercap,color=continent))+
  geom_point()+
  geom_line()

We can pipe objects directly into the ggplot() function:

gdp_c %>% 
  ggplot(aes(x=year,y=mean_gdpPercap,color=continent))+
    geom_point()+
    geom_line()

And even do this:

gapminder %>%
  dplyr::group_by(continent,year) %>%
  dplyr::summarize(mean_gdpPercap = mean(gdpPercap)) %>%
  ggplot(aes(x=year,y=mean_gdpPercap,color=continent))+
    geom_point()+
    geom_line()

5.2.0.1 Exercise

Plot the life expectancy over time of all countries for the years with population size larger than 2+06

gapminder %>%
  dplyr::filter(pop>=2e+06) %>%
  ggplot(aes(x=year,y=gdpPercap,color=country))+
    geom_point()+
    geom_line()+
    facet_wrap(~continent)+
    theme(legend.position = "none")

5.2.1 Some ggplot tricks

Make sure your data has in the write format (wide vs long). Usually, ggplot requires the data in long format. The functions tidyr::pivot_wider() and tidyr::pivot_longer() are very useful to transform one into the other.

?tidyr::pivot_wider()
?tidyr::pivot_longer()

To change the order of colors, modify the factor levels

gapminder %>%
  dplyr::group_by(continent,year) %>%
  dplyr::mutate(continent = factor(as.character(continent),
                                   levels = c("Oceania","Europe","Africa","Americas","Asia"))) %>%
  dplyr::summarize(mean_gdpPercap = mean(gdpPercap)) %>%
  ggplot(aes(x=year,y=mean_gdpPercap,color=continent))+
    geom_point()+
    geom_line()

You can store the plots in an object and keep adding layers to it

p<-gapminder %>%
    dplyr::group_by(continent,year) %>%
    dplyr::mutate(continent = factor(as.character(continent),
                                     levels = c("Oceania","Europe","Africa","Americas","Asia"))) %>%
    dplyr::summarize(mean_gdpPercap = mean(gdpPercap)) %>%
    ggplot(aes(x=year,y=mean_gdpPercap,color=continent))+
      geom_point()+
      geom_line()

# Change the color palette
p + scale_color_viridis_d(begin = 0.1,end=0.8)

6 Real life application

  1. How many clinics participated in the study, and how many valid tests were performed on each one? Did the testing trend vary over time?
  2. How many patients tested positive vs negative in the first 100 days of the pandemic? Do you notice any difference with the age of the patients? Hint: You can make two age groups and calculate the percentage each age group in positive vs negative tests, try using the function ifelse() to do this.
  3. Look at the specimen processing time to receipt, did the sample processing times improve over the first 100 days of the pandemic? Plot the median processing times of each day over the course of the pandemic and then compare the summary statistics of the first 50 vs the last 50 days
  4. Bonus: Higher viral loads are detected in less PCR cycles. What can you observe about the viral load of positive vs negative samples. Do you notice anything differences in viral load across ages in the positive samples? Hint: Also split the data into two age groups and try using geom_boxplot()
library(medicaldata)
covid<-covid_testing
dim(covid)
## [1] 15524    17

7 Software development concepts

7.1 Good coding practices

7.1.1 Script structure

  • Use comments to create sections.
  • Load all required packages at the very beginning.
  • Write all function definitions after package loading section or create a standalone file for your functions and call it in the main code.

7.1.2 Functions

Identify functions capitalizing the first letter of each word

# Good
DoNothing <- function() {
  return(invisible(NULL))
}

# Bad
donothing <- function() {
  return(invisible(NULL))
}

Use explicit returns

# Good
AddValues <- function(x, y) {
  return(x + y)
}

# Bad
AddValues <- function(x, y) {
  x + y
}

Define what the functions does, the input parameters, and output using comments inside the function

AddValues <- function(x, y) {
  
  # Description: Function to add to numeric variables
  # Input
  # x = numeric
  # y = numeric 
  # Output: numeric
  
  return(x + y)
}

Testing and documenting

  • Use formal documentation for functions whenever you are writing more complicated projects. This documentation is written in separate .Rd files,and it turns into the documentation printed in the help files.
  • The roxygen2 package allows R coders to write documentation alongside the function code and then process it into the appropriate .Rd files.
  • Formal automated tests can be written using the testthat package.

7.1.3 External packages

  • Packages are essentially bundles of functions with formal documentation. Loading your own functions through source("functions.R") is similar to loading someone else’s using library("package")

  • As a general rule, only load a package using library() if you are going to use more than two functions from if.

  • Use the name space when calling an external function. Not doing it can cause clashes when two packages have a function with the same name.

# Good
purrr::map()

# Bad
map()

7.2 Debugging and troubleshooting

General advice:

  • Create a minimal reproducible example of your error.
  • Whenever you see an error copy the full message and paste it in the search bar on your web browser. There is a lot of support out there, and most likely someone came across that same error before.